PRELIMINARIES

{MASS}, {car}, {curl}, {ggplot2}, {devtools}, {ggord}, {plotly}, {klaR}, {FactoMineR}, {lattice}, {Scatterplot3d}
An easy way to install ggord is to run the following:

library(devtools)
install_github("fawda123/ggord")
## Skipping install of 'ggord' from a github remote, the SHA1 (c92d629d) has not changed since last install.
##   Use `force = TRUE` to force installation

The link to this vignette (https://github.com/ilundeen/DFA)

OBJECTIVES

In this module we will examine Discriminant Function Analysis, a pattern-recognition statistical method that characterizes or separates two or more classes of objects or events. We will go over the different uses of DFAs, work through a basic example by hand, and will end with running an analysis together using actual data and function within packages in R.
Key concepts: Variance & covariance, Discriminant scores, Comparison to PCA & MANOVA

THE BASICS

Discriminant Function Analysis (DFA) is a populate statistical method used to determine which continuous independent variables discriminate between two or more categorical variables in such a way that the differences between predefined groups are maximized.

Last week we learned about Principal Component Analysis, a technique to describe the relationship and structure in data. We saw how useful it can be in finding the correlations between variables in multivariate data.
Discriminant function analysis is very similar to PCA. The major difference is that PCA calculates the best discriminating variables without prior knowledge about groups, whereas DFA calculates the best disciminating variables (=disciminants) for groups that are defined by the user.

DFA’s are good tools for:

  • Visualizing multivariate data
  • Understanding the most meaningful sources of variance in your data
  • Understanding the best descriminating variables in your data, with prior knowledge about groups
  • Predicting categorical groups
  • Determining which variables best predict groups
  • Reducing the changes of misclassifying your unknown data

Example in biology: you might use a DFA to determine which variables discriminate between leaves eaten by 1) sloths 2) howler monkeys 3) leaf ants. For that purpose, the biologist would collect data on numerous leaf characteristics of those species eaten by each of the animal groups. Ideally, most leaves will naturally fall into one of these three categories. DFA could then be used to determine which variables are the best predictors of whether a leaf will be eaten by sloths, monkeys, or ants.

DFA is like the reverse of a MANOVA, which is used to predict multiple continuous dependent variables by one or more independent categorical variables. Instead, DFA is a powerful predictor tool, and useful in determining whether a set of variables is actually effective in predicting categories. This means that DFA can only be used when groups are known a priori. You can think of DFA as a classification, with the proper dataset, DFA can distribute things into groups of the same type.

Discriminant analysis works by producing one or more linear combinations of predictor variables, creating a new underlying variable for each function. These functions are called discriminant functions. The number of possible functions is the number of groups minus one (ng - 1), or the number of predictors whichever is smaller. The first discriminant function, similar to the first principal component, maximizes differences between group on that function. The second function does the same, but also must not be correlated with the previous function, this continues for the rest of the functions.

Example on the board

Here is an animation showing how some projections of data are able to discriminate data better than others:

DEFINITIONS

Variance: A measure (similar to standard deviation) of how much spread, or range, is present in your dataset.
Covariance: A measure of how variables vary from the mean with respect to each other.
Variance-Covariance Matrix: Variance within each variable on the diagonal, and the covariance between groups is vertical
Discriminant score: The weighted linear combination (sum) of the discriminating variables or predictors.
Discriminant coefficient: The correlation between each predictor variable and the discriminant score of each function.
Centroid: The global means.
Training dataset: Dataset of variables with known groupings that is used to generate a model.
Linear discriminant function analysis: Method most often used, DFA when all the assumptions are met. AKA Linear Discriminant or Canonical Discriminant Function…
Fisher’s linear discriminant analysis: DFA where there are only two groups.
Quadratic discriminant function analysis: Used when you can’t meet assumptions for linear, i.e. Homogeneity of variances/covariances
Jackknife: A resampling technique (similar to bootstrapping) without replacement. In jackknifing, one sample is removed from your data, a discriminant model is calculated based on the remaining data, then your removed sample is reclassified based on the discriminant model produced without that sample. This is repeated for all your samples to get a jackknifed discriminant model of your entire dataset.


“Bootstraps bootstraps” - The scene in Pirates of the Caribbean, the movie about statistics on the high seas, where William Turner learns about a common resampling method - bootstrapping.


“No, not that knife, Will!” - The scene in Pirates of the Caribbean where Professor Jack Sparrow, the swashbuckling statistician, teaches young Will Turner about another common resampling method - jackknifing.

ASSUMPTIONS

The assumptions of DFA are the same as those for MANOVA

Sample Size: Sample size does not have to be equal. However, the sample size of the smallest group must be greater than the number of predictor variables.

Normal Distribution: DFA assumes the data represent a sample from a multivariate normal distribution.

Homogeneity of variances/covariances: DFA is sensitive to heterogeneity of variance-covariance matrices.

Outliers: DFA is also very sensitive to the inclusion of outliers. It is recommended to run a test for univariate and multivariate outliers for each group, and elimate or transform the outliers.

Non-multicollinearity: The independent variables should not be highly correlated with another. Also must be low multicollinearity of the independents.

MATHEMATICAL UNDERPINNINGS

In many cases—including those we have examined in class thus far—our research goal is to determine the effect of group membership on multiple dependent measures. In this case, the groups or factors, serve as independent or (if found to be significant) predictive variables, which allow us to (1) test for statistically significant differences between/among groups and (2) determine which of the dependent measures are most significantly correlated with or impacted by group membership.

MANOVA (Multivariate Analysis of Variance)—single-dichotomy/polytomy-multiple dependent variable
• Controls the experiment-wise alpha level (necessary when using multiple dependent variables)
• Allows you to determine which variables are most effected by the manipulation: What does this independent variable affect

An opposite approach

Discriminant Function Analysis

Question 1: Can we discriminate among groups, and if so, how well?
Question 2: Which combination of variables (a.k.a., canonical variate) best discriminates among groups?
Opposite approach: here we are using regression weights to form a composite dependent variable, and then using multiple regression formulas to assess the significance of group separation.

Put another way our factors (dichotomy/polytomy) have now become the dependent variables and our measurements have become our independent/predictive variables.Hence:
* The new axis created via DFA is a linear composite of the observed variables composed in an attempt to minimize the overlap in the distributions of the observed variables.
* This axis represents an approximate discriminant or canonical variate, a function in which the observed variables are weighted based on discriminant coefficients.

Distances between/among groups:

Euclidean Distances (i.e., straight line)

  • How far are different points from the origin?
  • Closest does not always correlate with ellipses (standard deviations).

Mahalonobis Distance

  • Takes into account the structure of the variance-covariance matrix (means of standardization).
  • Tells you the distance of a data point to the middle of a data cloud.
  • Can also be used to look between/among groups (i.e., distance between a point and a specific group mean); if there is a true group difference, we expect that data of group X will be closer in shape space to the mean of X than mean of Y or Z.

Calculating discriminant scores (mostly) by hand:

zih: discriminant/canonical score of individual i of group h
xih1: score of individual i on first observed variable
xih2: score of individual i on second observed variable
w1 and w2: Discriminant Coefficients (i.e., weights that define the discriminant/canonical variate in terms of the two observed variables)

More generally:

Let’s start with a simple example in which we have two continuous variables being measured on seven individuals. These individuals have been assigned to either group 1 or group 2.
As previously mentioned, the subscripts i and h refer to individual observation and group designation, respectively. Here we include an additional subscript, j, which refers to which of the two variables we are referencing.
Hence, in matrix X, the two columns represent the two variables being measured (j1 and j2) and each row represents the values for each of those variables per individual, i.

library(curl)
dfa <- read.csv(curl("https://raw.githubusercontent.com/ilundeen/DFA/master/dfaexample.csv"), 
    header = T)
library(MASS)
library(ggplot2)
plot <- ggplot(dfa) + geom_point(aes(j1, j2, colour = Group), size = 2.5)
plot

The next step is to divide matrix X into two separate matrices based on group:

The next step is to calculate the mean vectors for each group separately, as well as the global mean vector:

Using these vectors we can also calculate d vector (which we will need later on) by calculating the differences in group means:

Calculating the Variance-Covariance Matrix

Before we can calculate our VCV matrix, we must standardize our values by subtracting the global mean vector from each set of observations:

Brief review - what is a VCV matrix?

Each matrix is then multiplied by its posterior probability (in this case h1 = 4/7 and h2 = 3/7), and the two matrices are added together. This results in a pooled VCV matrix that encompasses the observations from the entire dataset. Next we compute the inverse of VCV matrix.


NOTE: If anyone feels compelled to manually calculate the inverse of a matrix, here are two very helpful links: Here & Here

However, there is also a quick and easy R function that will do this for you: solve(A)
This inverse matrix is then multiplied by the d vector (calculated previously) to compute our w vector, the values of which will be used to weight the variables in the original equation to calculate the discriminant score, z.

Alternatively, the matrix inversion and vector multiplication can be accomplished using one simple function in R. Assuming you have read in your VCV matrix A and your mean vector d, you can use the same ‘solve()’ function; in this case it would be inputted as: solve(A,d)

A <- matrix(c(0.206, -0.233, -0.233, 1.689), nrow = 2, ncol = 2)
d <- matrix(c(0.38, 1.65), nrow = 2, ncol = 1)
solve(A, d)
##          [,1]
## [1,] 3.494934
## [2,] 1.459041

CAMELID EXAMPLE

Davis and McHorse, 2013

Palaeontologia Electronica

A method for improved identification of postcrania from mammalian fossil assemblages: multivariate discriminant function analysis of camelid astragali.
Aepycamelus, one of these North American camelids

library(curl)  #needed to read in csv file from github
camels <- read.csv(curl("https://raw.githubusercontent.com/ilundeen/nothing-to-see-here/master/camel.csv"), 
    header = TRUE)
# Read in the .csv file
head(camels)
##    Mus                                               Loc
## 1 AMNH ""Alachula Clays"" Mixon's Bone Bed, Levy Co., FL
## 2 AMNH ""Alachula Clays"" Mixon's Bone Bed, Levy Co., FL
## 3 AMNH ""Alachula Clays"" Mixon's Bone Bed, Levy Co., FL
## 4 AMNH ""Alachula Clays"" Mixon's Bone Bed, Levy Co., FL
## 5 AMNH ""Alachula Clays"" Mixon's Bone Bed, Levy Co., FL
## 6 AMNH ""Alachula Clays"" Mixon's Bone Bed, Levy Co., FL
##                   Sp._       Genus   Genus..2.             Taxon Side
## 1 FLOR-121-2193 (1941) Aepycamelus Aepycamelus Aepycamelus major    R
## 2  FLOR 28 #489 (1940) Aepycamelus Aepycamelus Aepycamelus major    R
## 3 FLOR-152-2490 (1942) Aepycamelus Aepycamelus Aepycamelus major    R
## 4 FLOR-146-2450 (1942) Aepycamelus Aepycamelus Aepycamelus major    R
## 5  FLOR 30 #535 (1940) Aepycamelus Aepycamelus Aepycamelus major    R
## 6    FLOR 2 #31 (1940) Aepycamelus Aepycamelus Aepycamelus major    R
##       LM    TD    TI    TP     LL    WD    WI    LI
## 1  95.13 39.90 54.34 45.32 105.67 69.38 74.50 80.24
## 2  95.25 38.22 54.65 45.25 105.20 69.73 75.04 80.36
## 3 100.35 39.63 54.86 44.27 109.10 69.83 72.23 83.84
## 4  98.09 39.90 53.26 42.44 105.57 73.83 76.32 82.74
## 5  94.10 40.75 51.95 42.16 101.15 67.47 68.25 80.41
## 6  90.31 34.36 47.68 40.34 100.38 67.74 70.37 78.14
##                                       Notes Prin.Comp.1 Prin.Comp.2
## 1                                              4.726898  0.08269429
## 2 Two right and two left with Same Field No    4.664651 -0.16147633
## 3                                              4.882082 -0.05180152
## 4                                              4.828685  0.02919387
## 5                    Two with Same Field No    4.232418  0.36355821
## 6                  Three with Same Field No    3.570367 -0.45557335
##   Prin.Comp.3
## 1  0.11249259
## 2  0.06565884
## 3  0.17052671
## 4 -0.32338462
## 5  0.14464094
## 6 -0.23328700

Dimensions on astragali measured in this study
Abbreviations: LM= medial length; LI= intermediate length; LL= lateral length; TD= distal thickness; TI= intermediate thickness; TP= proximal thickness; WD= distal width; WI= intermediate width

Linear Discriminant Function

Here we’re going to use the lda() function contained in the {MASS} package
Note: There are other packages that run linear discriminant analyses including:
{DiscriMiner}
{mda}
{dawai}
{rrlda}
{sparsediscrim}

Running the DFA

library(MASS)  #lda() and qda() can both be found in this package 
camel.lda <- lda(Genus..2. ~ LM + TD + TI + TP + LL + WD + WI + LI, data = camels, 
    prior = c(1, 1, 1, 1, 1)/5, na.action = "na.omit")  # Note that we're using + instead of * becuase we aren't modeling any variable interactions here
# Prior is specified to be uninformative; if not included, default prior is
# set by relative abundances in the training set.
camel.lda
## Call:
## lda(Genus..2. ~ LM + TD + TI + TP + LL + WD + WI + LI, data = camels, 
##     prior = c(1, 1, 1, 1, 1)/5, na.action = "na.omit")
## 
## Prior probabilities of groups:
##  Aepycamelus     Alforjas Hemiauchenia  Megatylopus   Procamelus 
##          0.2          0.2          0.2          0.2          0.2 
## 
## Group means:
##                    LM       TD       TI       TP       LL       WD
## Aepycamelus  92.03708 37.36500 49.18667 41.68917 98.84625 66.96438
## Alforjas     58.12042 27.79625 33.21125 28.46583 63.85958 42.73292
## Hemiauchenia 50.45350 23.21675 27.96225 24.00013 54.34913 35.70287
## Megatylopus  79.62783 33.81348 43.73391 37.09565 87.04217 60.14348
## Procamelus   66.68000 28.91333 35.94000 33.16167 73.38333 47.09333
##                    WI       LI
## Aepycamelus  68.94229 78.17250
## Alforjas     44.22375 49.25000
## Hemiauchenia 38.46363 41.53437
## Megatylopus  61.44304 68.53522
## Procamelus   49.16167 55.34667
## 
## Coefficients of linear discriminants:
##            LD1         LD2          LD3         LD4
## LM  0.17843479 -0.26154962  0.612728795 -0.22214235
## TD -0.08833182  0.06511872 -0.237187959 -0.37465125
## TI  0.03972081  0.28547071  0.005929533 -0.28948794
## TP -0.10287848 -0.52113821 -0.301629648 -0.04092927
## LL -0.03361356 -0.29001249 -0.180872844  0.30903264
## WD  0.08460694  0.20045633 -0.236866116  0.32094209
## WI -0.04463453  0.04310350  0.243654169  0.09190961
## LI  0.23894225  0.48890370 -0.251019220 -0.15398852
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.9431 0.0377 0.0149 0.0044

Displayed in this output

Call:

The formula that was fitted

Prior probabilities of groups:

The probability of randomly selecting a sample falling within each group within our dataset. If unspecified in our call, this defaults to the actual proportion of the total of each user-defined group within our data. If specified, these are whatever we indicated as our priors.

Group means:

Table of average values for each of our variables by group. Useful for determining if any of our groups seem distinctive or somewhat off.

Coefficients of linear discriminants:

Coefficients of the discriminant function. There will be # of groups - 1 linear discriminants. Here we have 5 groups so we’ll have 4 linear discriminants.

Proportion of trace:

The proportion of trace shows the proportion of variance between our groups explained by the linear discriminants.

Now we are interested in how well our model reassigns our samples to our groups.

The predict() function allows us to check how successfully the linear function we just derived classifies our data into groups.

Prediction

camel.p <- predict(camel.lda)
summary(camel.p)
##           Length Class  Mode   
## class     181    factor numeric
## posterior 905    -none- numeric
## x         724    -none- numeric
camel.p$class[1:5]
## [1] Aepycamelus Aepycamelus Aepycamelus Aepycamelus Aepycamelus
## Levels: Aepycamelus Alforjas Hemiauchenia Megatylopus Procamelus
camel.p$posterior[1:5, ]
##   Aepycamelus     Alforjas Hemiauchenia  Megatylopus   Procamelus
## 1   0.9999704 4.844915e-30 1.551687e-44 2.961259e-05 9.607793e-19
## 2   0.9999878 1.717864e-31 1.021650e-45 1.215902e-05 5.361055e-20
## 3   1.0000000 2.538645e-40 3.216243e-56 2.257359e-09 5.636456e-27
## 4   0.9999998 2.698264e-38 1.491667e-53 2.349458e-07 1.194647e-27
## 5   0.9999840 1.297946e-30 3.752229e-46 1.597350e-05 9.370914e-22
camel.p$x[1:5, ]
##        LD1        LD2        LD3          LD4
## 1 7.883166 -1.3481248 -0.1649644 -0.517977536
## 2 8.122471 -1.0755247  0.4335491 -0.003859773
## 3 9.851413 -1.2776851  1.2335674 -1.242550331
## 4 9.560707  1.2919976  1.2909380  0.434597623
## 5 8.164379  0.6828456 -0.3542590 -2.396851848

Class:

The first output here shows how our data are classified based on our LDA model

Posterior:

The posterior probabilities shows the probability of each sample being classified in one group over another. These probabilities measure the strength of each classification. If one is substantially greater than the others, that sample is assigned to that group with a high degree of confidence. Here, many of these probabilities are high so these data are confidently assigned to these groups.

x:

This last part shows the discriminant function axis scores for each sample. These scores can be plotted to show distribution of our samples.

Plotting

There are a number of ways to visualize these discriminant scores

A simple way to do this is to use the ggord() function in the {ggord} package to plot our model.
Why use this package over others?
1) Super easy to use.
2) A spin-off of ggplot and we all love ggplot so you might like this, too.

library(ggplot2)
library(devtools)
install_github("fawda123/ggord")
## Skipping install of 'ggord' from a github remote, the SHA1 (c92d629d) has not changed since last install.
##   Use `force = TRUE` to force installation
# check out https://github.com/fawda123/ggord/blob/master/R/ggord.R to view
# original code for this package
library(ggord)

At it’s base level, ggord() is pretty good.
Check our this blog post if you want to learn more about what ggord can do.

ggord(camel.lda, camels$Genus..2.)

Easy.
But our data points are little big, and it’s difficult to tell how our variables are influencing our data so we’ll mess around with this plot a little bit.
The ggord default function is a little particular so if you want to use this package to make your own plots, you should check out the raw code for the default ggord() plot to see the easiest way to customize the code.

p<-ggord(camel.lda, camels$Genus..2., txt=3, vec_ext=6, size=1 ) #adjusting text size, vector length, and size of the points
p+ ggtitle("DFA of Davis & McHorse 2013 Camelid data") + #adding a title
     theme(plot.title = element_text(lineheight=.8, face="bold"))+ #changing the title text 
  scale_color_brewer(palette="Dark2") + #changing the palette
     theme_minimal() #adding a minimal theme because I like it more 

How does this compare to a PCA plot?

library(FactoMineR)  #add the package with PCA() to our library
oo <- PCA(camels[, 8:15], graph = FALSE)  #running our PCA model
pca. <- ggord(oo, camels$Genus..2., txt = 3, vec_ext = 3, size = 1)  #plot our principal components as we did with our linear discriminants
pca. + ggtitle("PCA of Davis & McHorse Camelid data") + theme(plot.title = element_text(lineheight = 0.8, 
    face = "bold")) + scale_color_brewer(palette = "Dark2") + theme_minimal()

There is more overlap between groups in our PCA plot than in our DFA plot. This is to be expected - recall that PCA tries to retain variability in the data while DFA retains between group variance. Also note that PC1 explains slightly more variance among our data than LD1 does among our groups.

Another nice way to visualize this data is by using the {plotly} package.

Allows us to look at 3 discriminant scores at the same time
Also The plotly package rocks and has very cool interactive visualizations and an online interactive plot making tool

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p<-plot_ly(camels, x=camel.p$x[,1], y=camel.p$x[,2], z=camel.p$x[,3]) %>% #indicate what your axes are
  add_markers ( #add markers and a hover label
    text=~paste(paste("LD1:",camel.p$x[,1]), paste("LD2:",camel.p$x[,2]), paste("LD3:",camel.p$x[,3]) , paste("Genus:", camels$Genus..2.), sep="<br />"), #what each element in your hover label will be and how you will separate them (if sep= "<br />" wasn't here they would all print on one long hover label)
    color=~camels$Genus..2., colors="Set1",symbol=I("circle"), size=I(6), hoverinfo="text") %>% #how you will define points (by genus), what colors you will use, the symbol and symbol size of each point, and what what info will be shown in the hover label (in this case it is what we just "text" as) 
  layout(scene=list(xaxis=list(title="LD1"),
                    yaxis=list(title="LD2"),
                    zaxis=list(title="LD3"))) #axes labels
p

We have it plotted, what now?

Assessing our DFA

We are now interested in how accurately our discriminant function analysis classified our samples into our groups.
To assess this, we need to compare our predicted groups with our user-specified groups.

cam <- table(camels$Genus..2., camel.p$class)  #make a table of our original group assignments and predicted group assignments
cam  #here the rows are original group assignments and the columns are predicted group assignments
##               
##                Aepycamelus Alforjas Hemiauchenia Megatylopus Procamelus
##   Aepycamelus           44        0            0           4          0
##   Alforjas               0       23            1           0          0
##   Hemiauchenia           0        0           80           0          0
##   Megatylopus            1        0            0          22          0
##   Procamelus             0        0            0           0          6

From here we can quantify the overall predictive accuracy of our LDA model.

sum(cam[row(cam) == col(cam)])/sum(cam)  #summing the number of samples that were classified in the prediction as in our original data divided by the total number of samples
## [1] 0.9668508

Since we’ll be doing this again, it will be useful to have a function that generates a confusion matrix and calculates overall accuracy. I didn’t write the following function. I got it from here and only modified it slightly but I like the output it generates. It sure makes it easy to spit out information about the accuracy of your model (as we did in multiple lines of code above but now we can do it in one line).

# this is a function that gives us the overally accuracy of our model, our
# prior probabilities, and a table demonstrating what proportion of our
# groups were classified in their correct/incorrect groups.
library(lattice)
confusion <- function(actual, predicted, names = NULL, printit = TRUE, prior = NULL) {
    # names and priors are null unless otherwise specified; R should print our
    # output
    if (is.null(names)) 
        names <- levels(actual)
    tab <- table(actual, predicted)
    acctab <- t(apply(tab, 1, function(x) x/sum(x)))
    dimnames(acctab) <- list(Actual = names, Predicted = names)
    if (is.null(prior)) {
        relnum <- table(actual)
        prior <- relnum/sum(relnum)
        acc <- sum(tab[row(tab) == col(tab)])/sum(tab)
    } else {
        acc <- sum(prior * diag(acctab))
        names(prior) <- names
    }
    if (printit) 
        print(round(c(`Overall accuracy` = acc, `Prior frequency` = prior), 
            4))
    if (printit) {
        cat("\nConfusion matrix", "\n")
        print(round(acctab, 4))
    }
    invisible(acctab)
}
confusion(camels$Genus..2., camel.p$class, prior = c(1, 1, 1, 1, 1)/5)  #generating our confusion matrix for our camel lda model
##             Overall accuracy  Prior frequency.Aepycamelus 
##                       0.9663                       0.2000 
##     Prior frequency.Alforjas Prior frequency.Hemiauchenia 
##                       0.2000                       0.2000 
##  Prior frequency.Megatylopus   Prior frequency.Procamelus 
##                       0.2000                       0.2000 
## 
## Confusion matrix 
##               Predicted
## Actual         Aepycamelus Alforjas Hemiauchenia Megatylopus Procamelus
##   Aepycamelus       0.9167   0.0000       0.0000      0.0833          0
##   Alforjas          0.0000   0.9583       0.0417      0.0000          0
##   Hemiauchenia      0.0000   0.0000       1.0000      0.0000          0
##   Megatylopus       0.0435   0.0000       0.0000      0.9565          0
##   Procamelus        0.0000   0.0000       0.0000      0.0000          1

But what if we want to use jackknifing?

Jackknifing

Reminder: Jackknifed validation (or leave-out-one cross-validation) excludes one of our samples, generates a discriminant function with the remaining samples, uses this discriminant function to reclassify our excluded sample, and repeats this for each sample in our data set.
This can be done simply by indicating CV=TRUE in our lda() call.

knifed_camel <- lda(Genus..2. ~ LM + TD + TI + TP + LL + WD + WI + LI, data = camels, 
    prior = c(1, 1, 1, 1, 1)/5, na.action = "na.omit", CV = TRUE)  #CV=TRUE means we're going to jackknife the cuss out of this thing
summary(knifed_camel)
##           Length Class  Mode   
## class     181    factor numeric
## posterior 905    -none- numeric
## terms       3    terms  call   
## call        6    -none- call   
## xlevels     0    -none- list
knifed_camel$class[1:5]
## [1] Aepycamelus Aepycamelus Aepycamelus Aepycamelus Aepycamelus
## Levels: Aepycamelus Alforjas Hemiauchenia Megatylopus Procamelus
knifed_camel$posterior[1:5, ]
##   Aepycamelus     Alforjas Hemiauchenia  Megatylopus   Procamelus
## 1   0.9999649 6.559234e-30 2.044637e-44 3.510405e-05 1.533861e-18
## 2   0.9999865 1.601968e-31 9.423372e-46 1.354178e-05 6.903964e-20
## 3   1.0000000 3.111942e-42 9.947166e-59 7.555982e-10 4.875788e-28
## 4   0.9999998 2.351039e-39 6.745386e-55 1.811409e-07 1.417319e-28
## 5   0.9999825 1.638383e-30 3.921160e-46 1.745387e-05 9.073691e-22

Class:

Classification of each sample using our jackknifing method. Note: since the jackknifing method includes reclassifying our sample based on the model generated when that sample was removed, we don’t use the predict() function.

Posterior:

Posterior probabilities of each sample being categorized within each group. Again, the strength of the new classification is indicated by these probabilities.

Terms:

Basically a summary of the varaiables in your model.

Call:

The formula that was fitted.

One final step:

Compare the accuracy of our jackknifed model with our non-jackknifed model.

# Assess the accuracy of the prediction
confusion(camels$Genus..2., knifed_camel$class, prior = c(1, 1, 1, 1, 1)/5)
##             Overall accuracy  Prior frequency.Aepycamelus 
##                       0.9551                       0.2000 
##     Prior frequency.Alforjas Prior frequency.Hemiauchenia 
##                       0.2000                       0.2000 
##  Prior frequency.Megatylopus   Prior frequency.Procamelus 
##                       0.2000                       0.2000 
## 
## Confusion matrix 
##               Predicted
## Actual         Aepycamelus Alforjas Hemiauchenia Megatylopus Procamelus
##   Aepycamelus       0.9167   0.0000       0.0000      0.0833          0
##   Alforjas          0.0000   0.9583       0.0417      0.0000          0
##   Hemiauchenia      0.0000   0.0125       0.9875      0.0000          0
##   Megatylopus       0.0870   0.0000       0.0000      0.9130          0
##   Procamelus        0.0000   0.0000       0.0000      0.0000          1
confusion(camels$Genus..2., camel.p$class, prior = c(1, 1, 1, 1, 1)/5)  #look at our original model again
##             Overall accuracy  Prior frequency.Aepycamelus 
##                       0.9663                       0.2000 
##     Prior frequency.Alforjas Prior frequency.Hemiauchenia 
##                       0.2000                       0.2000 
##  Prior frequency.Megatylopus   Prior frequency.Procamelus 
##                       0.2000                       0.2000 
## 
## Confusion matrix 
##               Predicted
## Actual         Aepycamelus Alforjas Hemiauchenia Megatylopus Procamelus
##   Aepycamelus       0.9167   0.0000       0.0000      0.0833          0
##   Alforjas          0.0000   0.9583       0.0417      0.0000          0
##   Hemiauchenia      0.0000   0.0000       1.0000      0.0000          0
##   Megatylopus       0.0435   0.0000       0.0000      0.9565          0
##   Procamelus        0.0000   0.0000       0.0000      0.0000          1

Generally speaking, we could expect our jackknifed model to not be nearly as accurate as our non-jackknifed model.

IRIS DATA


Ronald Fisher
The use of multiple measurements in taxonomic problems
Fisher’s 1936 Paper

  • 1936 publication used iris morphological features to demonstrate the use of linear discriminant analysis
  • Sometimes called Anderson’s iris data set (he originally collected the data)
  • Collected “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”
  • Fisher developed a linear discriminant model to distinguish among the species
  • Iris data set contained in the {datasets} package in R
  • (iris) or (iris3) learn about the difference in format of these datasets here


Ronald Fisher thinking about discriminant analysis…

LDA vs QDA

Now with the iris data set we can repeat what we did with the camelids and run a linear discriminant analysis on our data.

ldafit <- lda(Species ~ ., iris, prior = rep(1, 3)/3)
irispredict <- predict(ldafit)
confusion(iris$Species, irispredict$class)
##           Overall accuracy     Prior frequency.setosa 
##                     0.9800                     0.3333 
## Prior frequency.versicolor  Prior frequency.virginica 
##                     0.3333                     0.3333 
## 
## Confusion matrix 
##             Predicted
## Actual       setosa versicolor virginica
##   setosa          1       0.00      0.00
##   versicolor      0       0.96      0.04
##   virginica       0       0.02      0.98

Quadratic Discriminant Analysis with 3 groups
Reminder: Quadratic Discriminant Analysis is used when we can’t assume homogeneity of our variance-covariance matrices. Again we’ll use the {Mass} package but this time use the qda() function.

qdafit <- qda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
    iris, prior = rep(1, 3)/3)

Now we’ll use our confusion() function to get overall accuracy of our QDA model:

confusion(iris$Species, predict(qdafit)$class)
##           Overall accuracy     Prior frequency.setosa 
##                     0.9800                     0.3333 
## Prior frequency.versicolor  Prior frequency.virginica 
##                     0.3333                     0.3333 
## 
## Confusion matrix 
##             Predicted
## Actual       setosa versicolor virginica
##   setosa          1       0.00      0.00
##   versicolor      0       0.96      0.04
##   virginica       0       0.02      0.98

Some cool exploratory visualization tools

Partition plots

The {klaR} package allows us to look at how successful our classification methods are for every combination of 2 variables.

# Exploratory graph for LDA or QDA
library(klaR)
partimat(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
    data = iris, method = "lda")

Pairs plots

Similarly we can use the pairs() function in the basic graphics package to look at combinations of our variables.

pairs(iris[c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")], 
    main = "Scatterplot matrix for iris data", pch = 22, bg = c("red", "yellow", 
        "blue")[unclass(iris$Species)])

3D Scatterplots

Again, if we want to look at 3 axes at the same time 3D plots are useful.
The {scatterplot3d} package has nice visualization of scatterplots.
While they are not interactive and therefore fabulous, they are probably better for print, i.e. boring.
Here I’ll demonstrate this with the Sepal/Petal Widths/Lengths although you could also do one of these plots to show your LD scores as we did above with the camelids.

library(scatterplot3d)
#plot the following 4 plots on the same page
par(mfrow = c(2, 2)) #number of rows
mar0 = c(2, 3, 2, 3) #dimensions for each plot
scatterplot3d(iris[, 1], iris[, 2], iris[, 3], mar = mar0, color = c("red", #specify which columns of the data set you want to plot, what the margins should be, and what color
"yellow", "blue")[iris$Species], pch = 19) #what you should separate colors by (species) and the size of the points
scatterplot3d(iris[, 2], iris[, 3], iris[, 4], mar = mar0, color = c("red",
"yellow", "blue")[iris$Species], pch = 19)
scatterplot3d(iris[, 3], iris[, 4], iris[, 1], mar = mar0, color = c("red",
"yellow", "blue")[iris$Species], pch = 19)
scatterplot3d(iris[, 4], iris[, 1], iris[, 2], mar = mar0, color = c("red",
"yellow", "blue")[iris$Species], pch = 19)

Interactive 2D plots

Finally, because javascript is terrific, here’s a {plotly} plot of our LD scores for the iris data set:

p <- plot_ly(data = iris, x = ~irispredict$x[, 1], y = ~irispredict$x[, 2], 
    color = ~Species, text = ~paste("Species:", Species, "LDA1:", irispredict$x[, 
        1], "LDA2:", irispredict$x[, 2], sep = "<br />"), hoverinfo = "text")
p
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

On these 2D plots you can use your mouse to zoom in on certain parts of your plot and double click to zoom back out. COOL!

REFERENCES

  1. Alan Fielding
  2. SFSU
  3. R-Bloggers
  4. This nice vignette
  5. Berkeley
  6. Cliff N.1987. Analyzing multivariate data. San Diego, CA: Harcourt Brace Jovanovich.
  7. Discriminant Analysis Tutorial